Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #169 #405

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

giyaseddin
Copy link

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip",
}

_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING] # TODO: shall we add a non-existing task type such as `RQE`?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue description, it says it supports QA and RQE, is it enough to put the _SUPPORTED_TASKS this way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm... I think we should get away by using constants.Tasks.TEXTUAL_ENTAILMENT

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quiting from the readme:

We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type.

So it seems it could be possible to add NAMED_ENTITY_RECOGNTION/DISAMBIGUATION to this but not sure...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data provided in the XML files doesn't seem to be structured as NER e.g. this sample
I'll take another look to see if I could possibly parse them.


_HOMEPAGE = "https://github.com/abachaa/MedQuAD"

_LICENSE = "https://creativecommons.org/licenses/by/4.0/legalcode" # TODO: terms aren't available in the repository! In the issue it is 'CC BY 4.0'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in the description it says CC BY 4.0 is the license type, so I search the license terms online, I just wanted to make sure this is correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you find the license of the dataset? I cannot seem to find it...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check the license field in #169 description, you would see it, although I couldn't find it mentioned in the repo.
I'm not sure, but what I did is assume the license in the desc is correct, I looked for "terms of CC BY 4.0" and pasted the link :)
What do you think it should be @sg-wbi ?

@giyaseddin
Copy link
Author

This PR closes #169

Copy link
Collaborator

@sg-wbi sg-wbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giyaseddin Thank you very much for your contribution! Oh my, this seems to be a nasty one... Could you please check my comments?
I would like to help you out more, but first we should get the datalaoder to download the data in a reasonable amount of time/steps and see what we've got. Thanks!


_HOMEPAGE = "https://github.com/abachaa/MedQuAD"

_LICENSE = "https://creativecommons.org/licenses/by/4.0/legalcode" # TODO: terms aren't available in the repository! In the issue it is 'CC BY 4.0'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you find the license of the dataset? I cannot seem to find it...

f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip",
}

_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING] # TODO: shall we add a non-existing task type such as `RQE`?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm... I think we should get away by using constants.Tasks.TEXTUAL_ENTAILMENT

biodatasets/medquad/medquad.py Show resolved Hide resolved

qa_pairs_enriched_fpath = self._dump_xml_to_json(dl_manager)

# There is no canonical train/valid/test set in this dataset. So, only TRAIN is added.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this set, but the general scheme doesn't match, I implemented it to barely match the schema.

f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip",
}

_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING] # TODO: shall we add a non-existing task type such as `RQE`?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quiting from the readme:

We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type.

So it seems it could be possible to add NAMED_ENTITY_RECOGNTION/DISAMBIGUATION to this but not sure...

@sg-wbi sg-wbi self-assigned this Apr 11, 2022
@sg-wbi
Copy link
Collaborator

sg-wbi commented Apr 14, 2022

Thank you for checking out my comments @giyaseddin ! I am trying to inspect the dataset with:

[ins] In [7]: from datasets import load_dataset
ds = load_dataset("biodatasets/medquad/medquad.py", "medquad_source")

But I get this error:

    142         raise NotImplementedError("Only `source` and `bigbio_qa` schemas are implemented.")
    144     return datasets.DatasetInfo(
    145         description=_DESCRIPTION,
    146         features=features,
   (...)
    149         citation=_CITATION,
    150     )
--> 152 def _load_qa_from_xml(self, file_paths) -> List[dict[str, str | None]]:
    153     """
    154     This method traverses the whole list of the downloaded XML files and extracts Q&A pairs.
    155     Returns the extracted Q&As and the base directory of the dumped json file that contains them all.
    156     """
    157     assert len(file_paths)

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Could you please make sure we can load both source and bigbio w/o errors? Thank you!

@regel-corpus
Copy link

Hey @giyaseddin! Do you plan to work anymore on this?

@giyaseddin
Copy link
Author

Hey @regel-corpus,
I will push my last modifications ASAP.

@giyaseddin
Copy link
Author

Could you please check the current if it downloads correctly @sg-wbi?

@rosalinesway
Copy link

Hi @giyaseddin, I pulled the latest code, and it seems like this error still occurs upon loading. Could you check again if you have fixed it in your updates?

Thank you for checking out my comments @giyaseddin ! I am trying to inspect the dataset with:

[ins] In [7]: from datasets import load_dataset
ds = load_dataset("biodatasets/medquad/medquad.py", "medquad_source")

But I get this error:

    142         raise NotImplementedError("Only `source` and `bigbio_qa` schemas are implemented.")
    144     return datasets.DatasetInfo(
    145         description=_DESCRIPTION,
    146         features=features,
   (...)
    149         citation=_CITATION,
    150     )
--> 152 def _load_qa_from_xml(self, file_paths) -> List[dict[str, str | None]]:
    153     """
    154     This method traverses the whole list of the downloaded XML files and extracts Q&A pairs.
    155     Returns the extracted Q&As and the base directory of the dumped json file that contains them all.
    156     """
    157     assert len(file_paths)

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Could you please make sure we can load both source and bigbio w/o errors? Thank you!

@ruisi-su
Copy link
Collaborator

hi @giyaseddin, thanks for putting the effort to continue working on this dataset. Would it be possible to pull the up-to-date master into your branch? There are some inconsistencies between your branch and master, which blocks running the unit tests. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants